Objectives

  • try out the basics of text analysis
  • load digital-born PDF data into R
  • learn what the concepts of tokenisation, word frequency, and wordclouds mean
  • learn about sentiment dictionaries
  • explore and evaluate different approaches to sentiment measurement
  • apply the toolkit to a digital text of your choice (English or Danish)

This episode is intended for intermediate learners of R who wish to explore Sentiment Analysis. You can follow and execute individual chunks in this rmarkdown document and analyze the emotional loading of the IPCC (International Panel on Climate Change) Special Report on Climate Change. Once you understand the digital workflow you can analyze digital text of your choice. Ask questions whenever chunks do not render or produce confusing outputs.

We start first with IPCC text loading and data wrangling, and introduce basic text-mining concepts. Then we spend the bulk of time demonstrating different kinds of sentiment measurement with R tools (tidytext). We visualize the results in order to assess the strengths and shortcomings of these approaches for different research tasks.

A fantastic resource on tools and concepts is Julia Silge and David Robinson’s Text Mining with R.

Another text to accompany and further explain these concepts is by Nina Tahmasebi and Simon Hengchen (2019) The Strengths and Pitfalls of Large-Scale Text Mining for Literary studies, Samlaren

Get your environment set up

# Load general libraries
library(tidyverse)
library(here)

# Load libraries for text mining:
library(pdftools)
library(tidytext)
library(textdata) 
library(ggwordcloud)

Note

For more text analysis code, you can fork & work through Casey O’Hara and Jessica Couture’s eco-data-sci workshop (available here https://github.com/oharac/text_workshop)

Get the IPCC report into R

ipcc_path <- here("data","ipcc_gw_15.pdf")
ipcc_text <- pdf_text(ipcc_path)

Some things to notice:

  • How cool to extract text out of a PDF! Do you think it will work with any PDF?
  • Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
  • The pdf_text() function only sees text that is “selectable”

Example: Just want to get text from a single page (e.g. Page 9)?

ipcc_p9 <- ipcc_text[9]
ipcc_p9
[1] "                                                                                               Summary for Policymakers\n\n\n\n\nWe would also like to thank Abdalah Mokssit, Secretary of the IPCC, and the staff of the\nIPCC Secretariat: Kerstin Stendahl, Jonathan Lynn, Sophie Schlingemann, Judith Ewa, Mxolisi\nShongwe, Jesbin Baidya, Werani Zabula, Nina Peeva, Joelle Fernandez, Annie Courtin, Laura\nBiagioni and Oksana Ekzarho. Thanks are due to Elhousseine Gouaini who served as the                                          SPM\nconference officer for the 48th Session of the IPCC.\n\n\nFinally, our particular appreciation goes to the Working Group Technical Support Units\nwhose tireless dedication, professionalism and enthusiasm led the production of this\nSpecial Report. This report could not have been prepared without the commitment of\nmembers of the Working Group I Technical Support Unit, all new to the IPCC, who rose\nto the unprecedented Sixth Assessment Report challenge and were pivotal in all aspects\nof the preparation of the Report: Yang Chen, Sarah Connors, Melissa Gomis, Elisabeth\nLonnoy, Robin Matthews, Wilfran Moufouma-Okia, Clotilde Péan, Roz Pidcock, Anna Pirani,\nNicholas Reay, Tim Waterfield, and Xiao Zhou. Our warmest thanks go to the collegial and\ncollaborative support provided by Marlies Craig, Andrew Okem, Jan Petzold, Melinda Tignor\nand Nora Weyer from the WGII Technical Support Unit and Bhushan Kankal, Suvadip Neogi\nand Joana Portugal Pereira from the WGIII Technical Support Unit. A special thanks goes\nto Kenny Coventry, Harmen Gudde, Irene Lorenzoni, and Stuart Jenkins for their support\nwith the figures in the Summary for Policymakers, as well as Nigel Hawtin for graphical\nsupport of the Report. In addition, the following contributions are gratefully acknowledged:\nJatinder Padda (copy edit), Melissa Dawes (copy edit), Marilyn Anderson (index), Vincent\nGrégoire (layout) and Sarah le Rouzic (intern).\n\n\nThe Special Report website has been developed by Habitat 7, led by Jamie Herring, and\nthe report content has been prepared and managed for the website by Nicholas Reay and\nTim Waterfield. We gratefully acknowledge the UN Foundation for supporting the website\ndevelopment.\n\n\n\n\n                                                                                                                          5\n"

See how that compares to the text in the PDF on Page 9. What has pdftools library added and where?

Note

From Jessica and Casey’s text mining workshop: “pdf_text() returns a vector of strings, one for each page of the pdf. So we can mess with it in tidyverse style, let’s turn it into a dataframe, and keep track of the pages. Then we can use stringr::str_split() to break the pages up into individual lines. Each line of the pdf is concluded with a backslash-n, so split on this. We will also add a line number in addition to the page number.”

Wrangle the report in shape for analysis:

  • Split up pages into separate lines (separated by \n) using stringr::str_split()
  • Unnest into regular columns using tidyr::unnest()
  • Remove leading/trailing white space with stringr::str_trim()
ipcc_df <- data.frame(ipcc_text) %>% 
  mutate(text_full = str_split(ipcc_text, pattern = '\n')) %>% 
  unnest(text_full) %>% 
  mutate(text_full = str_trim(text_full)) 

# Why '\\n' instead of '\n'? Because some symbols (e.g. \, *) need to be called literally with a starting \ to escape the regular expression. For example, \\a for a string actually contains \a. So the string that represents the regular expression '\n' is actually '\\n'.
# Although, this time round, it is working for me with \n alone. Wonders never cease.

# More information: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html

Now each line, on each page, is its own row, with extra starting & trailing spaces removed.

Get the tokens (individual words) in tidy format

Use tidytext::unnest_tokens() (which pulls from the tokenizer) package, to split columns into tokens. We are interested in words, so that’s the token we’ll use:

ipcc_tokens <- ipcc_df %>% 
  unnest_tokens(word, text_full)
ipcc_tokens
# A tibble: 15,151 × 2
   ipcc_text                                                                         word 
   <chr>                                                                             <chr>
 1 "                Global warming of 1.5°C\n         An IPCC Special Report on the… glob…
 2 "                Global warming of 1.5°C\n         An IPCC Special Report on the… warm…
 3 "                Global warming of 1.5°C\n         An IPCC Special Report on the… of   
 4 "                Global warming of 1.5°C\n         An IPCC Special Report on the… 1.5  
 5 "                Global warming of 1.5°C\n         An IPCC Special Report on the… c    
 6 "                Global warming of 1.5°C\n         An IPCC Special Report on the… an   
 7 "                Global warming of 1.5°C\n         An IPCC Special Report on the… ipcc 
 8 "                Global warming of 1.5°C\n         An IPCC Special Report on the… spec…
 9 "                Global warming of 1.5°C\n         An IPCC Special Report on the… repo…
10 "                Global warming of 1.5°C\n         An IPCC Special Report on the… on   
# … with 15,141 more rows
# See how this differs from `ipcc_df`
# Each word has its own row!

Let’s count the words!

ipcc_wc <- ipcc_tokens %>% 
  count(word) %>% 
  arrange(-n)
ipcc_wc
# A tibble: 2,413 × 2
   word           n
   <chr>      <int>
 1 and          616
 2 the          505
 3 of           476
 4 to           407
 5 in           352
 6 c            283
 7 global       223
 8 confidence   213
 9 warming      188
10 for          174
# … with 2,403 more rows

OK…so we notice that a whole bunch of things show up frequently that we might not be interested in (“a”, “the”, “and”, etc.). These are called stop words. Let’s remove them.

Remove stop words:

See ?stop_words and View(stop_words)to look at documentation for stop words lexicons.

We will remove stop words using tidyr::anti_join():

ipcc_stop <- ipcc_tokens %>% 
  anti_join(stop_words) %>% 
  select(-ipcc_text)

Now check the counts again:

ipcc_swc <- ipcc_stop %>% 
  count(word) %>% 
  arrange(-n)

What if we want to get rid of all the numbers (non-text) in ipcc_stop?

# This code will filter out numbers by asking:
# If you convert to as.numeric, is it NA (meaning those words)?
# If it IS NA (is.na), then keep it (so all words are kept)
# Anything that is converted to a number is removed

ipcc_no_numeric <- ipcc_stop %>% 
  filter(is.na(as.numeric(word)))

A word cloud of IPCC report words (non-numeric)

See more: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

# There are almost 2000 unique words 
length(unique(ipcc_no_numeric$word))
[1] 1919
# We probably don't want to include them all in a word cloud. Let's filter to only include the top 100 most frequent?
ipcc_top100 <- ipcc_no_numeric %>% 
  count(word) %>% 
  arrange(-n) %>% 
  head(100)
ipcc_cloud <- ggplot(data = ipcc_top100, aes(label = word)) +
  geom_text_wordcloud() +
  theme_minimal()

ipcc_cloud

That’s underwhelming. Let’s customize it a bit:

ggplot(data = ipcc_top100, aes(label = word, size = n)) +
  geom_text_wordcloud_area(aes(color = n), shape = "diamond") +
  scale_size_area(max_size = 12) +
  scale_color_gradientn(colors = c("darkgreen","blue","red")) +
  theme_minimal()

Cool! And you can facet wrap (for different reports, for example) and update other aesthetics. See more here: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

Sentiment analysis

First, check out the ‘sentiments’ lexicon. Julia Silge and David Robinson in their book say that:

“The three general-purpose lexicons are

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.”

Let’s explore the sentiment lexicons. “bing” is included in the tidytext library, other lexicons (“afinn”, “nrc”, “loughran”) you’ll be prompted to download the first time you use them.

Watch out!

# Attach tidytext and textdata packages

# Uncomment the line below the first time you install the nrc dictionary 
# get_sentiments(lexicon = "nrc")
# When you get prompted to install lexicon - choose yes!


# Uncomment the line below the first time you install the afinn dictionary
# get_sentiments(lexicon = "afinn")
# When you get prompted to install lexicon - choose yes!

Note

WARNING: These collections include very offensive words. It’s best not to look at them in class.

afinn: Words ranked from -5 (very negative) to +5 (very positive) http://corpustext.com/reference/sentiment_afinn.html

get_sentiments(lexicon = "afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# … with 2,467 more rows
# Note: may be prompted to download (yes)

# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>% 
  filter(value %in% c(3,4,5))

# Do not look at negative words in class. 
afinn_pos
# A tibble: 222 × 2
   word         value
   <chr>        <dbl>
 1 admire           3
 2 admired          3
 3 admires          3
 4 admiring         3
 5 adorable         3
 6 adore            3
 7 adored           3
 8 adores           3
 9 affection        3
10 affectionate     3
# … with 212 more rows

bing: binary, “positive” or “negative” words. https://search.r-project.org/CRAN/refmans/textdata/html/lexicon_bing.html

get_sentiments(lexicon = "bing")
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# … with 6,776 more rows

nrc: Includes bins for 8 emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and positive / negative.

https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

get_sentiments(lexicon = "nrc")
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# … with 13,862 more rows

Note

Citations for all the lexicons

Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.

Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.

Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

Let’s do sentiment analysis on the IPCC text data using the afinn and nrc lexicons.

Sentiment analysis with afinn

First, bind words in ipcc_stop to afinn lexicon:

ipcc_afinn <- ipcc_stop %>% 
  inner_join(get_sentiments("afinn"))

Let’s find some counts (by sentiment ranking):

ipcc_afinn_hist <- ipcc_afinn %>% 
  count(value)

# Plot them: 
ggplot(data = ipcc_afinn_hist, aes(x = value, y = n)) +
  geom_col() +
  theme_bw()

Investigate some of the words in a bit more depth:

# What are these '2' words?
ipcc_afinn2 <- ipcc_afinn %>% 
  filter(value == 2)
# Check the unique 2-score words:
unique(ipcc_afinn2$word)
 [1] "strengthening" "support"       "inspired"      "integrity"     "sincere"      
 [6] "appreciation"  "generous"      "supported"     "commitment"    "confidence"   
[11] "determined"    "solid"         "supports"      "opportunities" "robust"       
[16] "growth"        "benefits"      "ability"       "comprehensive" "assets"       
[21] "importance"    "improved"      "effective"     "healthy"       "strong"       
[26] "strengthened"  "carefully"     "improving"     "clean"         "responsible"  
[31] "positive"      "strength"      "peace"         "justice"       "resolve"      
[36] "asset"         "secure"        "ambitious"     "innovative"    "strengthen"   
# Count & plot them
ipcc_afinn2_n <- ipcc_afinn2 %>% 
  count(word, sort = TRUE) %>% 
  mutate(word = fct_reorder(factor(word), n))


ggplot(data = ipcc_afinn2_n, aes(x = word, y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw()

# OK so what's the deal with confidence? And is it really "positive" in the emotion sense? 

Look back at the IPCC report, and search for “confidence.” Is it typically associated with emotion, or something else?

We learn something important from this example: Just using a sentiment lexicon to match words will not differentiate between different uses of the word…(ML can start figuring it out with context, but we won’t do that here).

Or we can summarize sentiment for the report:

ipcc_summary <- ipcc_afinn %>% 
  summarize(
    mean_score = mean(value),
    median_score = median(value)
  )

The mean and median indicate slightly positive overall sentiments based on the AFINN lexicon.

Sentiment analysis with nrc

We can use the nrc lexicon to start “binning” text by the feelings they’re typically associated with. As above, we’ll use inner_join() to combine the IPCC non-stopword text with the nrc lexicon:

ipcc_nrc <- ipcc_stop %>% 
  inner_join(get_sentiments("nrc"))

Wait, won’t that exclude some of the words in our text? YES! We should check which are excluded using anti_join():

ipcc_exclude <- ipcc_stop %>% 
  anti_join(get_sentiments("nrc"))

# View(ipcc_exclude)

# Count to find the most excluded:
ipcc_exclude_n <- ipcc_exclude %>% 
  count(word, sort = TRUE)

head(ipcc_exclude_n)
# A tibble: 6 × 2
  word         n
  <chr>    <int>
1 global     223
2 warming    188
3 1.5        169
4 pathways   111
5 chapter    103
6 2           95

Lesson: always check which words are EXCLUDED in sentiment analysis using a pre-built lexicon!

Now find some counts:

ipcc_nrc_n <- ipcc_nrc %>% 
  count(sentiment, sort = TRUE)

# And plot them:

ggplot(data = ipcc_nrc_n, aes(x = sentiment, y = n)) +
  geom_col()+
  theme_bw()

Or count by sentiment and word, then facet:

ipcc_nrc_n5 <- ipcc_nrc %>% 
  count(word,sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>% 
  top_n(5) %>% 
  ungroup()

ipcc_nrc_gg <- ggplot(data = ipcc_nrc_n5, aes(x = reorder(word,n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, ncol = 2, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(x = "Word", y = "count")

# Show it
ipcc_nrc_gg

# Save it
ggsave(plot = ipcc_nrc_gg, 
       here("figures","ipcc_nrc_sentiment.png"), 
       height = 8, 
       width = 5)

Wait, so “confidence” is showing up in NRC lexicon as “fear”? Let’s check:

conf <- get_sentiments(lexicon = "nrc") %>% 
  filter(word == "confidence")

# Yep, check it out:
conf
# A tibble: 4 × 2
  word       sentiment
  <chr>      <chr>    
1 confidence fear     
2 confidence joy      
3 confidence positive 
4 confidence trust    

Big picture takeaway

There are serious limitations of sentiment analysis depending on what existing lexicons you use. You should think really hard about your findings and if a lexicon makes sense for your study. Otherwise, word counts and exploration alone can be useful!

Your task

Choose one of the tasks below to practice your newly acquired sentiment analysis skills:

  1. Taking this script as a point of departure, apply sentiment analysis on the Game of Thrones. You will find a GOT.pdf in the data folder. What are the most common meaningful words and what emotions do you expect will dominate this volume? Are there any terms that are similarly ambiguous to the ‘confidence’ above?

  2. Choose an English text of your own and subject it to sentiment analysis. For example, you can use the Arabian Nights from lesson 08-text-analysis.Rmd

  3. Choose a Danish text of your preference and analyze it. Beware that for each language you need an appropriate sentiment dictionary. For Danish there is the ‘sentida’ package, available at https://github.com/Guscode/Sentida. The downloading instructions are available in the Readme - ask your instructors for clarification.

Credits

This tutorial is inspired by Allison Horst’s Advanced Statistics and Data Analysis.